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1.0  T£!]£££2]iBa2£2£2|g«£0& 

1.1  Introduction 

The  effort  for  July  1963  was  in  tlvj*  areas  of: 

1.  Reliability  Design  for  Computer  Organisations,  and 

2.  Reliability  Programming. 

1.2  Reliability  Dsslmn  for  Coaputer  Oraanlxationi 

Work  on  tbe  sequences  required  for  aelf-reorganizatlor  haa  been  ccm- 
pleted  and  docuaented  aa  Attachment  A  to  thia  report.  Covered  therein,  are 
program  aequencea  after  minor  failurea  and  aaater  aachine  aequencea  after  major 
failures.  Also,  the  physical  characteristics  of  the  aaater  and  its  intercon¬ 
nection  with  the  general-purpose  aachine  arc  discussed. 

A  method  for  increasing  the  reliability  of  self-reorganising  electronic 
digital  computers  by  employing  techniques  for  tolerating  coaaou  component  fail¬ 
ures  in  the  basic  flip-flop  has  been  developed.  This  resulted  front  a  study  o' 
component  re1 lability  and  the  effect  of  component  failures  on  circuit  operation 
in  light  of  the  logical  functions  to  be  performed.  It  van  shown  that  approxi¬ 
mately  80?  of  all  component  malfunctions  in  flip-flop  circuits  occur  in  one  of 
the  output  amplifiers,  leaving  the  other  output  and  the  bistable  portion  in 
operable  condition.  Using  this  fact,  gating  arrangements  and  programming  tech¬ 
niques  vere  developed  for  tolerating  such  faults. 

As  shown  on  the  milestone  chart,  work  on  this  task  has  been  completed. 

1.3  Reliability  Programming 

It  has  b'.*en  discovered  that  the  Set  State  Register  instruction  in  the 
seif-reorganizing  machine  has  the  effect  of  creating  additional  paths  for  infor¬ 
mation  flow  when  used  properly.  That  is,  when  reorganization  occurs,  the  infor¬ 
mation  in  the  register  which  is  moved  in  the  organization  moves  with  the  regis¬ 
ter.  Thus,  information  has  moved  thru  a  nev  path  and  allows  logic  which  was 
classified  as  essential  to  proper  operation  to  be  classified  as  nonessential. 

The  only  logic  or  information  paths  which  can  be  utilized  are  those 
which  are  under  program  control  and,  thus,  the  concept  of  microprogramming  may 
be  viewed  in  a  new  light. 

It  has  been  determined  that  the  MTBOF  to  OTBF  ratio  of  the  sample  ma¬ 
chine  is  raised  to  2.7  from  2.3  when  viewed  in  this  new  manner.  In  addition, 
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It  Is  sstimated  thet  microprogrea  techniques  applied  to  this  machine  will  raise 
the  ratio  to  3.3.  Vith  optimum  application  of  these  ideas,  a  ratio  of  k  is  hoped 
to  be  possible. 

The  sacrifice  for  this  advantage  is  a  small  anoint  of  aanory  and  speed. 
The  latter  is  only  sacrificed  after  a  logic  failure. 

A  detailed  development  of  a  unified  error-detecting  correcting  process 
is  still  underway.  The  general  approach  previously  dsscribed  has  tvolved  into  a 
more  specifically  described  process  as  a  result  of  a  study  of  certain  fundamental 
techniques.  A  combination  of  these  techniques,  which  will  provide  an  effective 
error-detecting  process,  is  presently  being  described. 

It  was  necessary  to  generate  a  complete  classification  of  errors  upon 
which  to  b«»s  the  detailed  techniques.  The  study  has  included  many  and  varied 
items,  such  as  the  dependency  of  the  detection  method  on  time  and  on  the  func¬ 
tional  location  of  the  fault  in  the  machine.  The  reasonableness  check  has  been 
studied  in  sene  detail  along  with  schemes  for  testing  the  memory.  Other  items 
Include  using  alternate  (or  repetitive)  computational  methods;  "offset  checking;" 
short  test  problems;  program  check  points;  checking  all  possible  commands  vith 
all  possible  options;  multiplication  as  a  checking  tool;  tests  based  on  symbolic 
logic  statements;  testing  the  accumulator,  adder;  hardvare  checking  devices; 
"address  coding;"  testing  the  addressable  devices;  prevention  of  excess  locoing; 
forced  periodic  branching;  and  error  counters. 

Emphasis  has  been  placed  on  analysing  the  effectiveness  of  the  detect¬ 
ing,  correcting  process,  both  during  fault-free  operation  and  operation  after 
error  detection. 

l.k  Follov-On  Effort 


Work  in  the  above  area  of  reliability  programming  will  continue.  Work 
in  the  areas  of  recommended  follov-on  effort  and  on  the  final  report  will  com¬ 
mence. 


1.5  Action  Items 


None. 

2.0  Trips  and  Visits 


On  22  July  1963,  Mr  1  Terris  of  HAC  visited  Mr  J  Y  Miyamoto  of  Aeroapace 
Corporation  to  discuss  the  technical  progress  or  tnis  study  contract. 
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3.0  Financial,  Manpower  and  Milestone  Status 
See  next  page. 

3.3  AaalgneA  Personnel 

J  L  Droyer,  Member  of  Technical  Staff 
C  A  Finnila,  Member  of  Technical  Staff 
I  Terris,  Project  Engineer 

3.4  Additional  Information 

The  dollar  and  manpower  data  used  in  the  "Actuals"  columns  of  this 
report  were  taken  from  Hughes  Aircraft  Company  Financial  Report 
Number  980-68  dated  28  July  1963. 

3.5  Milestone  Chart 

The  Milestone  Chart  can  be  found  on  the  last  page  of  this  report. 
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ATTACHMENT  A 


SELF-REORGANIZING  MACHINES  -  METHODS  OF  REORGANIZATION 


Introduction 


In  a  previous  report*,  the  concept  of  self-reorganizing  machines  vhich  elim¬ 
inate  unreliable  components  from  the  paths  of  information  flow  was  introduced. 

The  previous  report  was  concerned  with  the  mean  time  between  catastrophic  failures 
(MTBCF)  and,  by  considering  the  hardware  in  the  main  path  of  the  reliability  dia¬ 
gram  (essential  equipment),  it  was  demonstrated  for  a  sample  computer  that  self¬ 
reorganization  increases  the  MTBCF  by  a  factor  of  2.3. 

Previously,  only  the  reorganization  logic  and  equipment  were  explored  and  no 
mention  of  the  eequencea  or  methods  of  accomplishing  self-reorganisation,  except 
to  say  it  would  be  discussed  later,  was  made. 

In  this  report,  the  methods  or  sequences  required  for  machine  self-reorgani¬ 
zation  are  discussed  and  applied  to  the  model  machine  of  the  previous  report. 

2.  General  Philosophy 

Malfunctions  may  be  classified  Jn  many  manners;  i.e.,  duration,  effect.  For 
the  purposes  of  this  report,  classification  by  effect  on  operability  of  the  ma¬ 
chine  is  most  appropriate.  That  is,  malfunctions  shall  be  divided  into  two 
classes: 


a.  Minor  -  those  which  leave  the  machine  almost  operating  properly,  and 

b.  Major  -  those  which  do  not. 

Minor  malfunctions  leave  the  machine  in  a  state  where  self-reorganization  is 
possible,  while  major  malfunctions  leave  the  machine  in  a  state  from  which  no 
self-reorganization  or  diagnosis  routines  could  be  run.  For  recovery  from  the 
latter  type  malfunction,  it  is  necessary  to  have  a  master  unit  which  assumes 
command  of  the  machine,  diagnoses  the  feilure,  compensates  for  the  malfunction, 
and  returns  the  machine  to  the  self-operative  state.  Monitoring  of  the  machine 
by  a  master  is  on  a  continual  or  periodic  basis}  depending  on  the  allowable  down¬ 
time. 


Since  most  of  the  master  machine  is  not  required  for  successful  operation  of 
the  general-purpose  machine,  the  reliability  of  a  majority  of  the  master  does  not 
enter  into  the  first  order  computation  of  the  MTBCF.  Only  certain  failure  modes 
in  the  circuits  of  the  master  enter  into  the  first  order  approximation  of  the 
MTBCF.  Failures  in  a  majority  of  the  master  may  be  catastrophic  only  if  a  major 
malfunction  also  occurs  in  the  general-purpose  computer.  However,  this  requires 
tV'i  failures  and  only  appears  in  the  higher  order  terms  of  the  MTBCF  computation. 


•Attachment  A  of  Monthly  Progress  Report  #6,  or  IDC  292b, 01/10,  Techniques  for 
Improving  Computer  Reliability  by  Logical  Design  (Self-Reorganizing  Machines) 


The  over-all  program  flow  chart  for  a  sell-reorganizing  computer  iis  shown  in 
Figure  1.  The  programs  are  grouped  as: 

a.  Tactical, 

b.  Test, 

c.  Diagnosis,  and 

d.  Self-reorganizing. 

A  control  program  governs  entry  to  the  various  tactical  programs  and  generates 
an  output  to  the  master.  This  output  is  used  by  the  master  to  detect  major  mal¬ 
functions.  The  control  program  determines  entry  to  the  various  tactical  routines 
and  the  test  routine  which  must  be  entered  periodically.  Each  tactical  routine 
may  have  a  reasonableness  check  to  aid  in  error  detection.  This  check  may  be 
used  to  determine  actions  of  the  control  and  test  programs. 

The  test  routine  has  an  output  to  the  master  for  aid  in  major  malfunction  de¬ 
tection  and  tests  by  typical  self-test  routines  the  Integrity  of  the  computer. 

When  minor  malfunctions  are  detected,  the  diagnosis  routine  is  used  to  pin¬ 
point  the  malfunction,  if  possible,  and  determine  whether  self  reorganization  is 
desired.  If  self-reorganization  is  desired,  the  self-reorganization  routine  is 
run. 


In  the  case  of  a  major  malfunction,  the  computer  may  not  provide  the  proper 
outputs  to  the  master.  In  this  event,  the  master  performs  the  reorganization  via 
the  state  register  and  starts  the  computer  at  the  beginning  of  the  test  routine 
as  shown  in  Figure  1. 

3.  Reorganization  Sequences 

As  indicated  above,  there  are  two  reorganization  sequences,  one  for  minor 
failures  in  which  the  machine  still  can  perform  its  program,  and  a  second  for 
major  failures  in  which  the  machine  no  longer  can  function  properly.  Generally 
speaking,  malfunctions  in  the  arithmetic  unit,  input-output  unit,  and  parts  of 
the  control  unit,  art  minor  malfunctions  and  leave  the  machine  in  an  operational 
state.  Malfunctions  in  the  memory,  most  of  the  control  unit,  and  power  supply 
are  generally  of  the  major  category  and  render  the  macblns  inoperative.  In  this 
latter  case,  the  master  unit  assumes  command  and  reorganizes  the  machine  so  that 
it  cen  operate  again,  The  master  machine  detects  a  major  malfunction  by  watching 
outputs  of  the  machine  which  must  be  varied  in  a  predetermined  manner  by  a  posi¬ 
tive  action  of  the  machine  to  prevent  take-over  by  the  matter  maohlne.  In  the 
event  of  a  major  machine  malfunction,  there  it  a  high  probability  that  the  outputs 
are  not  properly  presented  to  the  master. 

3 . 1  Reorgenizatlon  Snousnces  after  a  Minor  Malfunction t 

2 


A  minor  malfunction  may  be  detected  by  built-in  hardware  checks  such  as 
a  parity  check  or  instruction  tagbits  to  insure  progi am  sequence,  or  by  a  pro¬ 
grammed  check  such  as  a  reasonableness  check  or  an  alternate  computation.  After 
a  minor  malfunction  is  detected,  a  diagnosis  routine  determines  whether  self¬ 
reorganisation  is  required  and  where.  Examining  the  sample  machine  in  the  refer¬ 
enced  report,  it  is  determined  that  malfunctions  in  the  arithmetic  unit,  input- 
output  unit,  end  the  index  register,  shift  counter,  and  spare  counter  of  the 
control  unit  would  be  considered  minor  malfunctions. 

.In  this  report,  malfunction  detection  and  classification  shall  not  bf- 
discussed.  The  self-reorganization  program  causes  self-reorganization  by  setting 
the  proper  state  flip-flops  as  discussed  in  the  previous  report. 

The  self-reorganization  routine  is  entered  from  the  diagnosis  routinf 
which  has  localized  the  malfunction  and  determined  that  self-reorganization  is 
required.  To  understand  the  construction  of  the  self-reorganization  routine,  it 
should  be  realized  that  this  routine  must  be  run,  at  least  in  part,  on  a  malfunc¬ 
tioning  machine.  Thus,  the  routine  is  constructed  so  that  it  is  set  up  prior  to 
Malfunction  occurrence  and  can  perform  any  anticipated  reorganization  without  use 
of  those  portions  of  the  machine  in  which  a  minor  malfunction  can  occur,  as  shown 
in  Figure  2.  After  reorganization,  the  routine  would  be  set  up  for  reorganiza¬ 
tion  after  a  second  failure. 

The  above  la  accomplished  by  the  use  cf  a  reorganization  routine  vhich 
for  the  first  portion  has  only  SSR  (Set-State  Register)  and  UCJ  (Unconditional 
Jump)  instructions.  These  do  not  uee  any  equipment  in  which  a  permanent  minor 
malfunction  may  occur.  The  diagnostic  enters  at  the  proper  SSR  instruction  and 
execution  of  one  or  several  of  these  instructions  performs  the  needed  self-reor¬ 
ganization.  Each  SSR  set  is  folloved  by  a  UCJ  to  the  second  half  of  the  routine 
which  modifies  the  SSR  instructions  of  the  first  part  for  a  later  possible  reor¬ 
ganisation.  The  modification  of  the  SSR  instructions,  vhicn  require  use  of  the 
arithmetic  unit,  ere  required  for  choice  of  an  alternate  register  in  the  sample 
computer. 


As  an  example,  let  us  consider  the  loss  of  the  index  register  (XR), 

The  first  alternate  would  be  the  least  significant  half  of  the  Q  register  and 
the  second  alternate  the  least  significant  half  of  the  A  register.  At  the  start, 
the  self-reorganisation  routine  for  index  register  replacement  would  be: 


1. 

SSR 

22 

1 

- 

Set  State  Register 

22 

to 

2. 

8SR 

2 

0 

- 

Set  State  Register 

2 

to 

3. 

88R 

1 

0 

- 

Set  State  Register 

1 

to 

u. 

UCJ 

Tr|_  r) 

•MWd 

The  first  action  causes  the  splitting  of  the  arithmetic  unit  end  sate  up 
most  significant  half  Arithmetic.  The  next  two  8SR  instructions  pick  Q  L8H  fc 
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the  XR  replacement.  Nov,  ve  modify  instruction  2  ir.  the  second  half  of  the  rou¬ 
tine  to  he 


SSR  2  1 

so  that  if  Q  ISH  fails,  repetition  of  this  same  portion  of  the  routine  (replace¬ 
ment  of  XR)  will  result  in  A  LSH  being  used  for  the  XR.  Note  that  the  diagnostic 
enters  the  self-reorganization  routine  at  the  same  place  on  a  second  failure  of 
the  XR  function,  and  thus,  itself,  does  not  require  any  modification. 

3.2  Sequence.*  for  Reorganization  after  Major  Halfuncticn: 

As  defined  above,  a  major  malfunction  would  be  such  that  the  computer 
would  fail  to  operate  in  a  nearly  normal  manner.  That  is,  the  computer  would  be 
incapable  of  running  a  self-diagnosis  and  causing  reorganization  to  permit  cor¬ 
rect  computational  operation.  In  this  case,  the  positive  action  required  to  pre¬ 
vent  the  master  unit  from  taking  over  operation  would  no  longer  be  performed  and 
the  master  unit  would  cctne  in  to  perform  the  necessary  detection,  diagnosis,  and 
reorganization. 

U,  Master  Unit 


To  gain  a  feeling  for  the  size  and  complexity  of  the  master  unit,  a  look  at 
the  functions  which  it  performs,  the  manner  in  which  they  are  performed,  and  the 
required  implementation  is  taken. 

Referring  to  the  simple  computer  in  the  reference,  it  is  seen  that  the  com¬ 
ponents  which  can  cause  major  malfunction  are  the  bit  and  word  counter,  the  pro¬ 
gram  counter,  the  instruction  register,  the  memory  address  register,  the  memory, 
the  memory  register,  the  logic,  and  the  clock  pulse  generator.  It  is  obvious 
that  some  malfunctions  would  cause  the  computer  to  lose  all  computation  capa¬ 
bility,  others  rssult  in  a  partial  but  major  loss,  and  others  result  in  a  com¬ 
plete  stoppage  of  the  machine. 

As  previously  mentioned,  the  master  monitors  the  computer  by  watching  the 
setting  of  certain  outputs  as  shown  in  Figure  1,  If  these  outputs  are  not  manip¬ 
ulated  in  a  prescribed  manner,  the  master  assumes  a  major  malfunction  has  occurrad. 

Although  no  mention  was  made  of  the  clock  pulse  supply  for  the  sample 
coaputsr,  it  is  reasonable  to  aisume  that  if  an  alternate  supply  vert  available, 
one  would  monitor  the  dock  pulses.  (Thsss  dock  pulses  ars  fed  to  the  various 
registers,  each  of  vhloh  have  a  gated  clock  pulse  amplifier  and  loas  of  the  clock 
pulaea  completely  stops  the  maobins.) 

Detection  of  dock  pulse  generator  failure  may  be  by  monostable  multi¬ 
vibrators,  one  for  sach  generator,  which  would  be  allowed  to  rsset  when  dock 
pulse*  have  been  missing  for  soms  predetermined  time.  Tha  resetting  of  a  multi¬ 
vibrator  would  cause  the  other  generator  to  be  used  as  the  clook  pulse  source, 

It 


h.2  Loss  of  Program  Sequence 


Loss  of  program  sequence  may  be  caused  by  either  a  permanent  or  non- 
permanent  malfunction,  and  may  or  may  not  result  'n  a  major  malfunction.  One 
could  envision  a  nonpermanent  malfunction  causing  data  to  be  interpreted  as  an 
order  and  the  computer  to  enter  a  loop  of  meantnglecs  calculation,  thus  gener¬ 
ating  a  major  malfunction.  On  the  other  hand,  a  permanent  malfunction  may  effect 
only  certain  instructions  and  result  in  a  loss  of  program  sequence  which  is  minor. 

Program  sequence  may  be  monitored  externally  by  the  u»a  of  outputs  which 
must  be  varied  in  a  prescribed  manner  by  the  program.  How  many  outputs  and  the 
complexity  of  their  variation  depends  on  the  machine  and  required  probability  of 
malfunction  detection  required.  The  latter  requirement  is  mission  and  time- 
dependent  . 


A  simple  scheme  would  use  one  output  vith  a  monostable  multivibrator 
which  is  hit  erery  time  the  test  routine  is  run  and  a  second  output  which  is  hit 
with  a  prescribed  minimum  frequency  by  the  other  programs.  This  scheme  may  be 
enlarged  by  use  of  more  outputs.  The  two  output  scheme  guarantees  at  least 
periodic  entry  to  the  test  routine  and  periodic  entry  to  a  tactical  routine. 

The  test  routine  should  be  complete  enough  to  guarantee  correct  operation  if 
entry  is  achieved  and  negate  the  need  for  more  outputs. 

If  program  sequence  is  lost,  the  master  machine  starts  the  computer  at 
the  beginning  of  its  test  routine.  If  the  malfunction  was  nonpermanent,  opera¬ 
tion  will  continue  in  the  normal  manner.  If  the  malfunction  was  permanent  and 
the  machine  is  capable  of  operation  (minor  malfunction) ,  the  test  routine  will 
reorganise  the  machine.  If  the  malfunction  is  permanent  and  major,  program 
sequence  will  be  lost  again  end  the  master  will  perform  the  necessary  reorgani¬ 
sation,  The  manner  in  which  this  reorganisation  is  performed  depends  upon  the 
complexity  of  the  master.  If  the  master  has  no  diagnostic  capability,  reorgani¬ 
sation  can  be  by  trial  sad  error,  Since  there  are  only  eight  registers  which 
oan  cause  a  major  malfunction,  repair  could  be  achieved  very  quickly  on  a  trial 
and  error  basis  if  only  ons  register  fells  at  a  time,  Aftsr  each  reorganisation 
trial  the  PC  is  set  to  the  address  of  the  first  order  in  the  test  routine  and  the 
machine  is  set  frse, 

Aftsr  reorganisation  is  aahlsved,  ths  sslf-raorganisatlon  routlns  ohsoks 
ths  stats  of  the  state  register  and  updates  itself. 

Prom  ths  above  description,  it  is  seen  that  the  master  may  oonslst  of 
a  counter  which  is  used  to  ssqusnos  itself  thru  a  short  wired  program  and  some 
ons-ihote,  as  shown  in  Figure  3, 
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